Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Q-ViT: Accurate and Fully Quantized Low-Bit Vision Transformer

Block.0.query

Block.3.query

Block.6.query

(b) Q-ViT

Block.0.query

Block.3.query

Block.6.query

(a) Full-Precision

FIGURE 2.3

The histogram of query values q (shadow) along with the PDF curve of Gaussian distri-

bution N(μ, σ²) [195], for three selected layers in DeiT-T and 4-bit fully quantized DeiT-T

(baseline). μ and σ²are the statistical mean and variance of the values.

For ease of training, the input to the matrix multiplication layers is set to ˆv, mathe-

matically equivalent to the inference operations described earlier. The input activations and

weights are set to 2, 3, 4, or 8 bits for all matrix multiplication layers except the ﬁrst and

last, which are always set to 8 bits. This standard practice in quantized networks has been

shown to improve performance signiﬁcantly. All other parameters are represented using

FP32. Quantized networks are initialized using weights from a trained full-precision model

with a similar architecture before being ﬁne-tuned in the quantized space.

2.3

Q-ViT: Accurate and Fully Quantized Low-Bit Vision

Transformer

Inspired by the success of natural language processing (NLP), transformer-based mod-

els have shown great power in various computer vision (CV) tasks, such as image clas-

siﬁcation [60] and object detection [31]. Pre-trained with large-scale data, these mod-

els usually have many parameters. For example, 632M parameters consume 2528 MB of

memory usage and 162G FLOPs in the ViT-H model, which is expensive in both mem-

ory and computation during inference. This limits the deployment of these models on

resource-limited platforms. Therefore, compressed transformers are urgently needed for real

applications.

Quantization-aware training (QAT) [158] methods perform quantization during back-

propagation and achieve much less performance drop with a higher compression rate in

general. QAT is eﬀective for CNN models [159] for CV tasks. However, QAT methods still

need to be explored for low-bit quantization of vision transformers. Therefore, we ﬁrst build

a fully quantized ViT baseline, a straightforward yet eﬀective solution based on standard

techniques. Our study discovers that the performance drop of fully quantized ViT lies in the

information distortion among the attention mechanism in the forward process and the in-

eﬀective optimization for eliminating the distribution diﬀerence through distillation in the

backward propagation. First, the ViT attention mechanism aims to model long-distance

dependencies [227, 60]. However, our analysis shows that a direct quantization method

leads to information distortion and a signiﬁcant distribution variation for the query mod-

ule between the quantized ViT and its full-precision counterpart. For example, as shown

in Fig. 2.3, the variance diﬀerence is 0.4409 (1.2124 vs. 1.6533) for the ﬁrst block ¹. This

1 supports the Gaussian distribution hypothesis citeqin2022bibert